The Quill Distributed Analytics Library and Platform

نویسندگان

  • Badrish Chandramouli
  • Raul Castro Fernandez
  • Jonathan Goldstein
  • Ahmed Eldawy
  • Abdul Quamar
چکیده

This technical report introduces Quill (stands for a quadrillion tuples per day), a library and distributed platform for relational and temporal analytics over large datasets in the cloud. Quill exposes a new abstraction for parallel datasets and computation, called ShardedStreamable. This abstraction provides the ability to express efficient distributed physical query plans that are transferable, i.e., movable from offline to real-time and vice versa. ShardedStreamable decouples incremental query logic specification, a small but rich set of data movement operations, and keying; this allows Quill to express a broad space of plans with complex querying functionality, while leveraging existing temporal libraries such as Trill. Quill’s layered architecture provides a careful separation of responsibilities with independently useful components, while retaining high performance. We built Quill for the cloud, with a master-less design where a language-integrated client library directly communicates and coordinates with cloud workers using off-the-shelf distributed cloud components such as queues. Experiments on up to 400 cloud machines, and on datasets up to 1TB, find Quill to incur low overheads and outperform SparkSQL by up to orders-of-magnitude for temporal and 6× for relational queries, while supporting a rich space of transferable, programmable, and expressive distributed physical query plans.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Quill: Efficient, Transferable, and Rich Analytics at Scale

This paper introduces Quill (stands for a quadrillion tuples per day), a library and distributed platform for relational and temporal analytics over large datasets in the cloud. Quill exposes a new abstraction for parallel datasets and computation, called ShardedStreamable. This abstraction provides the ability to express efficient distributed physical query plans that are transferable, i.e., m...

متن کامل

A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection

Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....

متن کامل

CARDAP: A Scalable Energy-Efficient Context Aware Distributed Mobile Data Analytics Platform for the Fog

Distributed online data analytics has attracted significant research interest in recent years with the advent of Fog and Cloud computing. The popularity of novel distributed applications such as crowdsourcing and crowdsensing have fostered the need for scalable energy-efficient platforms that can enable distributed data analytics. In this paper, we propose CARDAP, a (C)ontext (A)ware (R)eal-tim...

متن کامل

Ophidia: Toward Big Data Analytics for eScience

This work introduces Ophidia, a big data analytics research effort aiming at supporting the access, analysis and mining of scientific (n-dimensional array based) data. The Ophidia platform extends, in terms of both primitives and data types, current relational database system implementations (in particular MySQL) to enable efficient data analysis tasks on scientific array-based data. To enable ...

متن کامل

Scalable Analytics over Distributed Time-series Graphs using GoFFish

Graphs are a key form of Big Data, and performing scalable analytics over them is invaluable to many domains. As our ability to collect data grows, there is an emerging class of inter-connected data which accumulates or varies over time, and on which novel analytics – both over the network structure and across the time-variant attribute values – is necessary. We introduce the notion of time-ser...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016